Personalized next-best action recommendation with multi-party interaction learning for automated decision-making

Automated next-best action recommendation for each customer in a sequential, dynamic and interactive context has been widely needed in natural, social and business decision-making. Personalized next-best action recommendation must involve past, current and future customer demographics and circumstances (states) and behaviors, long-range sequential interactions between customers and decision-makers, multi-sequence interactions between states, behaviors and actions, and their reactions to their counterpart’s actions. No existing modeling theories and tools, including Markovian decision processes, user and behavior modeling, deep sequential modeling, and personalized sequential recommendation, can quantify such complex decision-making on a personal level. We take a data-driven approach to learn the next-best actions for personalized decision-making by a reinforced coupled recurrent neural network (CRN). CRN represents multiple coupled dynamic sequences of a customer’s historical and current states, responses to decision-makers’ actions, decision rewards to actions, and learns long-term multi-sequence interactions between parties (customer and decision-maker). Next-best actions are then recommended on each customer at a time point to change their state for an optimal decision-making objective. Our study demonstrates the potential of personalized deep learning of multi-sequence interactions and automated dynamic intervention for personalized decision-making in complex systems.


Introduction
In enterprise and complex problem-solving, automated and personalized decision-making is highly needed but rarely possible in practice. Personalized decision-making requires personalized next-best actions to be learned and used in a dynamic, sequential and interactive process and context, which is extremely demanding in both private and public sectors and natural and social systems. Examples are next-best treatments to-be-made by healthcare providers on patients, next-best trading strategies to-be-taken by investors in a capital market, next-best interventions on cybersecurity attacks or climate change in real time, next-best communications between a bank and its clients, and any other services involving client-provider interactions. Personalized next-best action-taking sets up a high standard for long-term dependent, dynamic, sequential and interactive personalized and automated decision-making in sophisticated and constrained real-life environments. However, automated decision-making with personalized next-best action recommendation is extremely challenging: (1) the circumstances and response behaviors of each target client (interchangeable with customer) must be characterized and modeled when they evolve over time; (2) any decision actions taken by a decision-maker on the client at a time step takes place in a sequential and interactive context, where both client responses and decision actions interact and co-evolve under client-specific circumstances and decision-policy constraints, forming multiple interactive and coupled sequences; (3) often multiple decision choices are available at a time step, and the best decision action needs to fit client states, expected decision goals and effect, and the underlying environment; (4) taking any decision actions will further affect the state, action and environment at the next time step, and the cumulative effect from all prior steps also evolves along the sequential action-response interactions, which form long-term dependencies between multiple sequences to affect the next-best action selection; and (5) while each next-best action is to achieve an expected local goal and effect on the client, the sequence of next-best actions should generate the optimal global goal and effect.
In practice, domain-driven action rules are often generated and tuned by a group of domain experts to address the complexities in the aforementioned personalized next-best action-taking in complex enterprises and systems. This domain-driven action selection collectively considers and balances the relationships between service policies, constraints, client circumstances, business procedures, risk indicators, decision rules, and intervention strategies. Although handcrafted action rules may be effective for specific and static scenarios on a small scale, they are ad hoc and ineffective for wide and dynamic applications and for large-scale real-time decision-making. They lack a general and proactive capacity to tackle personalized, sequential and interactive decision-making and often result in issues such as a high false intervention rate, high missing rate, and low cost-effectiveness.
The advances in new-generation artificial intelligence and data science have made possible the automated selection and optimal recommendation of personalized next-best actions in the above complex decision-making settings. This, however, poses a significant challenge to existing decision-support systems and modeling methods, including sequential decision-making [1,2], sequential and personalized recommendation [3][4][5][6], and deep learning [7,8]. To the best of our knowledge, there are no existing theories or modeling methods capable of handling the aforementioned demand and challenges in an automated or semi-automated manner. Typical sequential decision-making methods [9][10][11][12][13] assume that decision-making falls in the Markov decision processes (MDPs), i.e., the next state only depends on the current state and action [1]. Other approaches involve all historical states such as by weighing their impact on current states [14][15][16]. They do not fit personalized decision-making that goes beyond Markovian [16][17][18], which involves complex interactions and couplings between clients and providers and their states, responses and actions [19][20][21][22][23][24], as well as their dynamics and adaptation to bi-party (or multi-party) interactions [25,26]. More recent work selectively represents historical interactions between clients and decision-makers using methods such as temporal logic-based models [27][28][29][30][31], and recurrent neural networks (RNNs) with memories [32,33]. However, they are ineffective for next-best action recommendation, since they either treat states and actions homogeneously, i.e., ignoring the differences between states and actions, or ignore their complex interactions and couplings, by taking a predefined action on a state without selecting the actions for the best fit between clients, states, actions, and contexts. In addition, personalized recommendation and sequential recommender systems (including next-item and next-basket recommendation) have emerged recently [3,4,[34][35][36] to recommend particular or next products to users who may prefer in the next context. The existing methods do not involve comprehensive user-product couplings and heterogeneities (i.e., non-IIDness of users and items [37]), dynamic user-product interactions, sequential actions and responses, or optimal decision effects, etc. In addition, intensive research has been done on group decision making and recommendation [38][39][40], which are irrelevant to this work.
Here, we introduce a computational approach: a reinforced coupled recurrent network (CRN) to model the intrinsic nature of recommending personalized next-best actions in the aforementioned complex decision-making settings. CRN integrates deep learning, reinforcement learning, behavior informatics and recommender systems to learn dynamic, sequential, interactive and personalized decision-making processes. First, CRN models client circumstances, states, behaviors, responses and decision-making actions by multi-dimensional sequential representations using recurrent neural networks. This captures and transforms the states and behaviors of clients and actions made by decision-makers and their evolution into computable vector representations. Second, CRN builds a coupled recurrent unit (CRU) to capture relevant historical behaviors and simultaneously learn the following sophisticated couplings and interactions between clients and decision-makers on the above learned sequential representations using two long-term memories and five control gates: (1) the long-term sequential dependencies between an action and its previous actions taken by a decision-maker, called action-action dependence, to reveal the influence and transition between a series of prior actions and the current action; (2) the long-term sequential dependencies between a response and its previous responses made by a client, called responseresponse dependence, to learn the influence and transition between previous responses and the current one; and (3) the long-term sequential dependencies between a current response and its corresponding previous actions, called action-response dependence, to model the influence and transition between previous sequential actions and the current response of a client. As a result, CRU captures, represents and memorizes a sequence of relevant interactions between a client and a decision-maker with their particular states and behaviors and their history. Third, CRN combines the represented behaviors with the client's current state features and transforms them to a compact client state representation, which models client states and their transition. Lastly, CRN models the reward to candidate actions and learns the dependence between the current reward to actions and the next client state in a compact state representation to determine the next-best action tailored for the client to achieve the decision goal.
The CRN model was tested in a major Australian government agency for debt collection to recommend next-best intervention actions on specific debtors for tailored, active and efficient debt collection. CRN automatically recommends the next-best action tailored for each debtor at a particular time by incorporating the debtor's current state and historical records, the government's optional and constrained action sequences, and reward to actions specified by their debt collectors (domain experts) measuring the effectiveness of action on debt collection. In contrast to the related work that either assumes a Markovian property of sequential decisionmaking actions or has a limited computational capability in modeling complex contexts and interactions in personalized decision-making, our approach collectively involves and automatically learns sequences of decision actions, client behaviors and states, their interactions and transitions, the action-action, response-response and action-response dependencies, and the action effect (reward) on client responses in dynamic, sequential and interactive decisionmaking contexts at a client level.

Learning next-best actions
Assume a next-best action selection process (illustrated in Fig 1) involves a client and their demographics and states, a decision-maker and their actions taken on the client under certain policy constraints, the response (behaviors) of the client to the actions, and the reward that measures the effectiveness of an action on the client to achieve business goals at a time point. For example, in government services such as social welfare and taxation, when a client incurs a debt (called a debtor, i.e., a government client who owes money to the government), the government may take a series of actions to recover the debt in full or fast. Although debt collection is a widely used yet sophisticated process, experienced debt collectors not only consider a debtor's circumstances, the government's service policies and constraints, business objectives, and the effect of particular actions, they also monitor a debtor's responses to the implemented actions before a new action is taken. Some collectors may quantify the rewards for applied actions to indicate their effectiveness in intervention. At present, such action-based debt collection is mainly driven by business assumptions and rules, i.e., debt collection rules which we also call domain-driven action rules for complex systems and decision-making [41]. Domaindriven action rules play an important role in active and personalized debt collection using the collectors' experience, understanding and belief of the debtors' circumstances and possible responses and judgment in matching actions with client profiles.
However, domain-driven action selection is often ad hoc, costly and unsuitable for complex enterprise decision-making. A debt collection action must be carefully chosen and applied on a debtor at a particular time point by considering the client's circumstances, the government's policies and service objectives, the previous actions already taken on the client, the debtor's responses, the potential response to an action, and the business impact of interventions (e.g., whether the debt will be collected faster, in a less costly manner etc.). The action selection process also needs to consider a debtor's evolving circumstances, which further change during the sequential interactions with the government. Consequently, debt collection often involves a sequence of constrained candidate actions and the interactions with debtors in dynamic contexts sequentially and interactively. In summary, smart debt collection must be tailored for each debtor and debt case, dynamic in terms of catering for evolving debtor circumstances and business environmental settings (i.e., states), interactive between debtors and debt collectors with their iterative communications over the collection process, and sequential with both preceding and successive actions and states considered.
We model the above debt collection problem illustrated in Fig 1 as personalized next-best action recommendation on each client in a dynamic, sequential, interactive and constrained decision-making process. This personalized next-best action recommendation involves client information, sequences of client and decision-maker behaviors, and interactions between clients and the decision-maker under certain contexts and constraints at each time point.
Without loss of generality, we assume a client c over time t can be described by a three-ele- where D t = {d i |i = 1, � � �, n d } refers to a set of client's relatively stable information d i (e.g., the demographics of the client); n d refers to the size of D t ; A t−1 = {a i | i = 1, � � �, t − 1} refers to a sequence of t − 1 past actions sequentially assigned by the decisionmaker to the client before the current time t, in which a i refers to the action assigned at the past time step i (i � t − 1); and O t = {O t,i |i = 1, � � �, t} refers to a sequence of client responses to the correspondingly assigned actions during the interaction time period, in which O t,i = {o j | j = 1, � � �, n r } refers to a set of responses consisting of o j made by the client at time step i to action a i−1 (i � t) and n r is the size of the response set. C t thus jointly captures the client's circumstances, behaviors, prior decision-making actions taken on the client, and the client responses to the actions. Accordingly, C t forms a comprehensive representation of client states, which will be further used to model the interactions with the decision-maker and quantify the effect of next-best action candidates. Further, after taking action a i , a reward value r <Ci , a i > measures the effectiveness of a i on the client's next responses O t,i+1 . The larger r <Ci , a i > indicates higher effectiveness. At the current time t, a subset of k actionsÂ � t ¼ fa j t jj ¼ 1; � � � ; kg are selected as the next-best actions on the client from a candidate action set A � t satisfying policy constraints to achieve the top-k highest rewards fr <C t ;a j t > jj ¼ 1; � � � ; kg. In practice, k = 1 indicates that only the action associated with the highest reward is recommended, corresponding to the next-best action.
By empowering reinforcement learning [12,42,43] for sequential and interactive decisionmaking, the next-best action corresponds to the decision action that can lead to the highest reward per client state and to achieve the decision goal, which is learned by an action-value function r θ (�, �). We learn the action-value function r θ ð�; �Þ : C � A !R, which formulates the response's reward r <Ci , a i > of action a i (a i 2 A) on the client's representation C i (C i 2 C) at time step i, where C, A, andR are the spaces of client descriptions, decision-making actions, and estimated rewards. Assuming R represents the space of a real reward, the personalized next-best actions fa j t jj ¼ 1; � � � ; kg at time t for client c satisfies the following objective function: where Div(�||�) is the divergence between the estimated reward spaceR and the actual reward space R, and θ refers to the parameters in the action-value function r θ (�, �).
The above action-value function differs from the typical reinforcement learning settings and Markovian decision processes where the action-value should be modeled as r θ ð�; �Þ : O � A ! R, i.e., on the dependence between decision actions a t and client responses O t,t (O t;t 2 O; O is the space of the client's responses), which only selects the action based on the current state but ignores the client's sequential behaviors in history. On the contrary, our action-value function captures the client circumstance D t and his sequences of response behaviors O t on decision-making actions A t−1 using a comprehensive client description C t rather than client responses O t,t at each time step t. Our action-value function thus models the complex dependencies between states, between actions, and between states and actions in the sequential state-action-response-coupled sequences (Fig 1), which sufficiently represent pastto-present interactions between a client and his decision-maker during sequential and interactive decision-making processes.
We further adopt empirical error minimization to learn the action-value function r θ (�, �) in Eq (1). For a group of n c clients at time step t, we collect information about historical sequences of decision actions, responses, and rewards of each client c (j) , and define the objective function below to learn the action-value function capturing the long-term dependent interactions within the client group: where lð�; �Þ : R � R ! R refers to a loss function that measures the difference between the real and estimated rewards, C ðjÞ i refers to the description of the j-th client at time step i, a ðjÞ i refers to the historical decision action on the j-th client at time step i, and t (j) refers to the maximal length of historical sequence of the j-th client. Our model also captures the client's behaviors within function r θ (�, �), which caters for personalized recommendation for each client c (j) . Consequently, rather than only assuming the Markovian property between states, we model the long-term dependencies between client states, between decision actions, and between states and actions by jointly involving client circumstances, response behaviors to actions, and action constraints and rewards. In doing so, we capture the rich, personalized and evolving couplings and interactions in sequential, dynamic and interactive decision-making processes between individual clients in their group. After learning the action-value function r θ (�, �), we further learn the personalized next-best actionsÂ � t ¼ fa j t jj ¼ 1; � � � ; kg from the candidate action set A � t by optimizing the following objective function: For example, for the aforementioned debt collection, we model each debtor's state at time t by involving the debtor's demographics, debt amount and duration, historical debt collection actions applied by the government, and response behaviors, etc. to represent the debtor's current description C t , and further collect optional and sequential debt collection actions A � t considerable by the government. We aim to optimize the objective function in Eq (3) to obtain the next-best intervention

Modeling the process of personalized next-best action-oriented decisionmaking
We model the personalized next-best action-oriented decision-making process as a personalized next-best action recommender, as shown in Fig 2. The next-best action recommender achieves the objective defined in Eq (3) in terms of two main learning tasks: (1) learning the action-value function, and (2) selecting the next-best actions. The first task learns the actionvalue function r θ (�, �), which is then used in the second task to evaluate the actions in the candidate set based on a client's behaviors and current state. Those actions with the top-k highest rewards are then recommended as the next-best actions. Learning the action-value function is achieved by learning the personalized client representation and the action reward prediction. The personalized client representation module represents each client C t in terms of the client's demographics, behaviors and current state as a vector s t , which represents the client state at time t. The action reward prediction module further feeds s t to a selected action a � t and evaluates the action reward (i.e., effectiveness) in terms of the learned action-value function r θ (�, �). Those actions with the top-k highest rewards are selected from the candidate action set and recommended as the next-best actionsÂ � t . This design enables the candidate action set to be dynamically updated, which fits dynamic and constrained decision-making environments, where decision actions are constrained by related policies and/or environmental settings. This approach is also more efficient that other approaches such as multi-class classification-based action recommendation, because it does not need to estimate the probabilities of all possible actions (such estimation is often inefficient and may generate meaningless results in practice).

Personalized client representation by coupled recurrent networks
We represent each client description C t by a personalized client representation module. It captures the relatively stable client circumstances and the sequence of prior response behaviors to a sequence of corresponding past actions applied for decision-making up to time t. As a result, each client c is comprehensively yet compactly represented by a state vector s t at time t. This transforms a client's cumulative behaviors, current state and sensitivity to decision actions The framework for modeling the next-best action-oriented personalized decision-making. C t refers to the representation describing client c at time t, s t is the vector of the client's state representation, a t refers to an action selected from the candidate action set A � t , a t refers to the vector representation of action a t , andÂ � t is the set of recommended next-best actions. The recommender first embeds a client's demographics, behaviors and current state to a state vector s t by the personalized representation module (Fig 3), then feeds s t and a t into the reward prediction module to evaluate the effectiveness of the action. The actions in the candidate set with the top-k highest rewards are then recommended as the next-best actions. https://doi.org/10.1371/journal.pone.0263010.g002

PLOS ONE
Personalizing next-best action recommendation for automated decision-making into a universal vector space. This personalized client representation of each client's past and current situation forms a universal yet tailored foundation to further determine different decision-making tasks on the client level and makes it benchmarkable for different clients with the same state representation. We thus can make personalized next-best action recommendation in this client representation space for each client.
Since a client's past behavior sequence reflects his personal responses and preferences to the actions taken by decision-makers in the past, different actions will be selected as the next-best ones to be taken on clients who share similar states to fit their respective preferences and achieve the best possible reward for each client. Our method reveals the cumulative action effectiveness and the sensitivity of a client to actions by learning the complex interactions between a client's responses and assigned actions. In addition, involving a client's personal information at each time point further explains the fitness between decision actions and client circumstances. For example, debtors with different demographics and family situations likely respond differently to the same debt collection action in a government debt recovery campaign. Our approach of integrating a client's historical behavior sequence and their current personal information captures comprehensive factors affecting decision-making and is much more powerful than Markovian process models and other relevant methods.
We learn the personalized client representation using a coupled recurrent network (CRN, Fig 3). Given a client tuple C t =< D t , A t−1 , O t >, the decision action a i 2 A t−1 and the set of client responses O t,i 2 O t at each prior time step i are sequentially fed into the CRN. Initially, the client response's hidden state is extracted by a fully connected network from the client's relatively stable personal information. An embedding layer transforms actions described by categorical values (e.g., sending a message to a debtor) to numerical vectors. CRN embeds the

PLOS ONE
client behaviors and personal information as a vector s imp , which describes the hidden state of each client at time t in terms of a data-driven implicit feature since s imp is purely generated based on the client's observable data and its characteristics by the deep network. We also extract domain-driven explicit features designed by domain experts to describe the explicit situations in the CRN and transform it to a vector s exp . Lastly, a client's current state is represented by a vector s t which fuses the client's hidden state s imp and explicit state s exp through fully connected layers.
CRN captures the complex couplings and interactions within and between the sequences of client states and decision actions in history and models client historical behaviors and interactions with the decision-maker using a coupled recurrent unit (CRU, Fig 4). Similar to the gated recurrent unit (GRU) [44], CRU stores the historical information in its outputs. However, there are two outputs in CRU rather than one as in GRU, which correspond to actions and responses, respectively. Specifically, the historical sequences of actions and client's responses are stored in a � tÀ 1 and o � t , respectively. CRU adopts two gates r o and r a to control the impact of historical response and action information on their current states respectively. Meanwhile, gates z o and z a control the impact of current states on updating the memory of historical information. In addition, CRU has an interaction gate r i to capture the dependence between a decision action and a client response. With vector representation o t of the client's response O t,t at time t and vector representation a t−1 of decision action a t−1 at time t − 1, the variables in CRU are calculated as follows:

PLOS ONE
where σ(�) is the sigmoid function, tanh(�) is the hyperbolic tangent function, � refers to the Hadamard product, 1 a and 1 o are vectors with all elements as 1 and with a n a × 1 dimension and a n o × 1 dimension, respectively, W z a ; W r a ; W i ; W a ; U z a ; U r a , and U a are learnable matrices with a n a × n a dimension, W z o ; W r o ; W o ; U z o ; U r o and U o are learnable matrices with a n o × n o dimension, U i is a learnable matrix with a n a × n o dimension, and I o is a learnable matrix with a n o ×n a dimension, n o is the dimension of response vector representation o, and n a is the dimension of action embedding a.
As a result, each client is comprehensively represented in terms of his circumstances, past decision actions received, past responses to the actions, and domain-driven factors considered in the decision-making process. For all clients, a personalized representation (see an example in Fig 5) is learned for each of them. The learned representations differ from or are similar to each other, corresponding to the similarity between their demographics and responses to actions. This provides a universal, comprehensive and benchmarkable representation to further conduct personalized decision-making.

Reward prediction of next-best actions on client states
We further measure the reward of each decision action on a client state using a reward prediction module (Fig 6), which is built on a residual network. The above learned client state representation vector s t and an action a j t selected from a set of candidate actions A � t that satisfy

PLOS ONE
decision-making policy constraints are input into the reward prediction module. The candidate action a j t is first embedded through an action embedding layer (the same as the action embedding layer in the personalized client representation module) to a j t . Further, this embedded action is concatenated with the client state representation vector s t as the input of the following three-layer residual network. The last layer of the residual network predicts the reward r θ ðC t ; a j t Þ of each input action a j t corresponding to the target client state C t . The residual network-based reward prediction module shows unique strengths in efficiently modeling large-scale sequential decision-making actions. First, reward prediction is efficient in processing a large number of states and actions since it learns a common reward prediction model for different clients. Given a client state representation, it efficiently predicts the reward values for different actions. Second, modeling complexity can be automatically controlled since the residual network structure is embedded with a potential bypass from lowlevel information to high-level information. When the input data involves hierarchical patterns, the high-level features will be learned for the final prediction. For data with simple patterns, the low-level features will make a direct contribution to the final prediction. This reduces over-fitting in reward prediction and enables personalized client representation to be well learned to capture heterogeneous client behaviors, which are embedded in a common space for further decision-making tasks.
The next-best action recommendation module assesses the learned reward r θ (�, �) associated with each action in the candidate set A � t for client c at time step t to judge the effectiveness of taking action for decision-making. Those actions with the top-k (k is a hyperparameter to be determined by decision makers) highest rewards are recommended as the next-best actionŝ A � t � A � t for the client.

Strategies to learn from hierarchical imbalanced action-response interactions
Real-life data often presents imbalanced distributions [45]. In our case study of five-year debt collection data, we find it highly imbalanced and hierarchical across the attributes, attribute values, domain-driven rewards, and reward levels ( Table 1). With respect to the actions, their

PLOS ONE
frequency distribution is extremely imbalanced, which we call action imbalance. Some commonly taken actions may appear thousands of times more than other rarely taken actions. Regarding the client interactions, the counts of interactions between actions and clients are imbalanced, resulting in client interaction imbalance. For example, a small fraction of the client cohort may involve a large proportion of interactions. With regard to the reward of actions, most of the reward values given by domain experts to actions may be 0, leading to reward imbalance. Lastly, the action effectiveness is different, where a small number of actions are very strong and effective, thus they always generate a high reward, resulting in action effectiveness imbalance. These hierarchical imbalanced distributions in actions, interactions, rewards and action effects bring a significant challenge to the personalized recommendation of next-best actions. Action imbalance makes the model sensitive to those actions with high frequency but insensitive to the rarely appearing actions. This is caused by the model parameters that are trained predominantly by samples with high-frequency actions in the training phase if the imbalance is not catered for. The client interaction imbalance also affects the training of CRN. Since the sequence lengths of past client behaviors and decision actions are both short in most cases, it is difficult for CRN to effectively capture the long-term dependencies in those few but long historical sequences. Further, reward imbalance induces the reward prediction of the model to be 0. This results in most prediction results being 0, hence the model cannot generate the nextbest actions. In addition, the action effectiveness imbalance also results in the model consistently selecting those highly effective actions (which are usually tough actions) by prediction, which tends to recommend tough actions at all times for all clients. However, such recommendations mostly violate government service policies and constraints. In addition, the various imbalances are mixed with each other in the action-response interaction sequences, further increasing modeling difficulty. Consequently, the imbalanced distributions at different aspects bring significant but different challenges to personalized next-best action modeling.
Accordingly, we propose several strategies to improve CRN training (Section) and tackle the challenges brought by the hierarchical imbalances in action-response interactions. The key idea behind these strategies is to introduce explicit knowledge to regulate the implicit learning of multiple sequences and their dependencies in CRN (see the section on personalized client representation). Specifically, the various imbalances are first statistically quantified; then, the statistic information is used to sample the training data, weight the importance of samples, and adjust the effect on reward prediction loss. The respective strategies to tackle the imbalance at different aspects are as follows.
• Action imbalance: Setting the weight of client c with action a i as where f a j is the frequency of action a j , and m is the total number of actions. To reflect action imbalance in the loss function (Eq (3)), the loss value on the client is multiplied by w i c for backward gradient propagation.
• Client interaction imbalance: Sampling the training data with probabilities {p i |i = 1, � � �, n c } for all n c clients in each batch, p i is the sampling probability of the i-th client and is calculated as where l j is the length of historical information of the j-th client and n c is the number of clients.
• Reward imbalance: Setting the weight of the reward r <Ct , a t > to action a t on client C t as The loss value (Eq (3)) of the client with reward r <Ct , a t > is multiplied with w r , and only the top-k largest loss values in a batch are selected for backward gradient propagation.
• Action effectiveness imbalance: Adjusting reward r <Ct , a t > in training samples as where t is the time duration (i.e., the current time step) when action a t is assigned.

The pilot settings and characteristics
A backtesting of our personalized next-best action recommendation was conducted on fiveyear (2012-2017) debt collection data in a major Australian government agency. A subset of 5-year debt-related data from the government was used, which comprises 61,361 clients, 10 selected debt collection actions, and 66,126 client response-government action sequences in a total of 111,514 debt transactions. The data comprises attributes about client demographics and circumstances, the debt amount and duration at each time point associated with a debtor, a list of optional debt collection actions and their application policy constraints, a sequence of historical actions taken by the government on a debtor to recover the debt at each time point, the corresponding client response behavior to each debt collection action, and the time information associated with debt cases, responses and actions. In debt collection, those actions that likely bring about faster and more debt recovery are deemed as high reward. Debt collection experts rate the reward associated with each action on the debtor population (rather than individual debtors). Accordingly, we categorize all optional actions into two categories: (1) the low-reward action group where actions receive reward less than 0.5, and (2) the high-reward action group where actions receive reward larger than 0.5. The corresponding reward distribution of 10 selected debt collection actions (annotated for privacy consideration) is shown in Table 1, where the distribution of actions and their rewards over the five years is highly imbalanced. The most frequent action is Action 6 (A6) which appeared 62,263 times, while the least frequent action is Action 2 (A2) which only appeared 390 times. The length distribution of historical action sequences on each debtor is also imbalanced. Only 50% of clients had their action sequence length larger than 4. In addition, the domain-driven rewards given to these debt collection actions are also imbalanced, the mean reward of all actions is under 0.32, and the highest reward given to all actions equals 1. These show the need for handling the hierarchical imbalances with our strategies proposed in Section.
We randomly split the data into training, validation and testing sets in proportions of 70%, 10%, 20% respectively. Due to resource constraints in the pilot, the government only selected a proportion of debtors from the entire debtor pool to apply the intervention actions recommended by our method. We calculated the average domain-driven reward given by the debt collectors of the 10% highest predicted reward by our CRN model and reported it as our modeling performance. This result was agreed by the debt collectors to indicate how much percentage of debts can be deducted on average if the debt intervention was based on the next-best actions recommended by our model. For privacy reasons, we cannot report the government information or any details about the debtors and debt collectors in the pilot and cannot directly report the average debt deduction percentage incurred by our recommendations in comparison to that driven by the government's rule-based action selection strategies. Instead, we report the reward lift and error reduction made by our model recommendations in comparison with the domain-driven debt collection rules.

Baseline methods
We test our CRN model against (1) domain-driven rules i.e. the debt collection rules defined by the debt collection experts, (2) variants of three state-of-the-art deep models with modifications to cater for the next-best action recommendation: Google's wide-and-deep (WD) model, LSTM and GRU-based RNNs, and (3) the combination of wide-and-deep model with RNN strategies. Specifically, the domain-driven rules were taken by the government, where debt collection actions were taken according to the government's debt collection policies and constraints defined by debt collection experts. Such domain-driven rule-based action-taking method reflects the best practice in the debt collection business and was taken as best practice, thus we treat it as the baseline to evaluate the effectiveness and business impact of our model recommendations. Second, the WD model was shown to achieve state-of-the-art results in recommendation [46]. It reflects the performance of the state-of-the-art Markov decision process, and we revise it to learn decision rules based on the current state of a client. Third, the LSTM and GRU-based RNNs are shown to be effective in learning long-term dependencies. We embed historical client states into LSTM and GRU to transform a non-Markovian decision process to a Markovian decision process. They serve as the performance benchmark of the state-of-the-art Markovian decision process learning. Lastly, we combine the WD RNN with LSTM and GRU to take advantage of the two advanced deep modeling mechanisms: residual network (Res) and multiple layers (Multi), to form the best possible non-Markovian decision process learners: WD_LSTM, WD_GRU, WD_Res_LSTM, WD_Multi_LSTM, WD_Res_-GRU, WD_Multi_GRU. They reflect the best possible performance we may achieve by hybridizing the state-of-the-art achievements in deep learning.
We empirically evaluate the performance of the proposed personalized next-best action recommender CRN in terms of the following aspects: (1) Ability to reveal whether our model can effectively predict an accurate reward value; (2) Business impact to demonstrate whether the recommended next-best actions can lead to an estimated high reward for business in practice; and (3) Scalability to reflect whether CRN is scalable for handling a large amount of data.
In our experiments, CRN represents each client's demographic features (e.g., client type, address, and industry sector, etc.) to form the initial states of CRU. This solves the cold-start problem in decision-making by assuming that clients with similar demographic features likely share similar behaviors. Our model uses the ReLU activation function [47] for nonlinear mapping and has a batch-normalization layer after all non-linear layers. All multi-layer perceptron (MLP) networks in our model have three layers. We train the CRN model using the Adam algorithm [48] with a batch size of 128.

Recommendation of next-best actions for each client
We applied the recommended next-best actions for five-year debt collection. As shown in Table 2, our reward prediction module achieves 2.1942 total average reward lift (total_avg) and 2.4954 action average reward lift (action_avg) in comparison with 2.1089 (total_avg) and 2.2049 (action_avg) by Google's best-performing WD model, leading to a 4.04% and 13.18% improvement, respectively in recommending 10 next-best actions that satisfy the policy constraints for debt collection. By applying a hierarchical imbalanced training strategy (discussed in the Method section) on the CRN for reward prediction, our method achieves a reward lift of 2.5569 (total_avg) and 3.4599 (action_avg), which is 21.24% and 56.92% better than the total average and action average reward lift made by the WD model.
In the pilot, those actions with an estimated reward larger than 0.5 were applied as an intervention with their debtors for faster, less costly and more debt collection. By comparing the domain-driven reward given by the debt collectors, we evaluate the precision of CRN-recommended actions in terms of calculating the percentage of domain-driven high-reward actions that CRN also predicts as high-reward ones. The results in Table 3 show that CRN results in 2.6465 (total_avg) and 3.2799 (action_avg) lift, which is 5.38% and 11.74% better than the best-

PLOS ONE
performing WD model. CRN_IMB further shows that CRN improves action_avg to 3.3816, which is 15.20% better than the WD model. Our method largely improves the precision for those actions rarely applied in business (e.g., A2 which only appeared 100 times in five years), which are shown to be more effective for some debtors. We evaluate CRN effectiveness w.r.t. the mean squared error (MSE, Table 4) of recommendations of next-best actions for five-year debt collection, which measures the difference between the domain-driven reward given by debt collection experts and the reward predicted by CRN for each action in the 10 action candidates. CRN recommendations achieve the best overall MSE results, i.e., total_avg at 0.0777 and action_avg of 0.0613, and CRN makes a 3.24% and 7.26% improvement over the best-performing WD model in terms of total_avg and actio-n_avg, respectively.
We further test our CRN model to show it can efficiently model large-scale client-decisionmaker interactions, as shown in Fig 7. In our test environment (Section), CRN converges within 20 epochs, and the mean computational cost in each epoch is around 2 minutes in our testing environment. These empirical results show that CRN can be applied to large-scale interaction data and problems.

Discussion
Personalized decision-making reflects a deep understanding of each customer's circumstances and precision interventions on the customer (client) for optimal objectives. This is challenging when dynamic, interactive and sequential decision-making processes are involved. In this work, personalized deep learning is proposed to learn and recommend next-best actions for each customer in the above context. We model the client-decision-maker interactions and their decision-making context related to client circumstances and behaviors and decisionmaker actions and constraints. The proposed reinforced coupled recurrent network (CRN) provides a general neural multi-sequence interaction learning solution to formalize multiparty interactions with real-life evolving, long-term dependent states and behaviors of customers and intervention actions by decision-makers for automated personalized decisionmaking. The CRN incorporated with coupled recurrent units (CRU) effectively and efficiently models and recommends next-best actions for each client-oriented dynamic, personalized and sequential decision-making. CRU (1) reveals the complex long-term dependencies between client states, between decision actions, and between client responses and decision-maker interventions, and (2) involves and determines the client and decision-maker's historical information relevant to their responses and actions. In this way, we are able to model the complex multi-sequence interactions and coupling relationships between customer states and behaviors and decision-maker's actions and constraints for dynamic and personalized decision-making. This involves characterizing and coupling the roles, relationships and dynamics of clients and decision-makers in past, present and future decision-making processes. Our multi-sequence interaction learning method shows the potential of effectively modeling multi-aspect sequential, interactive and long-term dependencies, learning sequential historical information about a client's circumstances and sequential behavior responses to decision actions, and capturing the dynamic sequential interactions between client responses and decision actions. Our method, thus, goes beyond the usual way of assuming such decision-making processes as Markovian or convertible to Markovian [21,[27][28][29]49], which often only captures short-term dependencies in a single sequence and incurs a high computational

PLOS ONE
cost and a high rate of meaningless recommendations. Our method captures the above diverse multi-sequence-coupled and long-term dependencies while also controlling the computational cost. This explains why our method outperforms the wide-and-deep model, which utilizes the Markovian decision process.
We also show the potential of personalized decision-making by selecting actions for each client at each time point based on a deep representation of individual-level decision-making processes over time. CRN embeds the CRU-captured historical information and the current client state as a compact representation to learn decision rules, i.e., the dependencies between the current reward and historical states and actions. All these happen in a personalized and optimal manner, i.e., resulting in recommending next-best actions for each client at each time point per the then context.
This study also goes beyond non-Markovian decision process-based decision-making modeling [15,16,50], which models historical information by assuming a non-Markovian process but overlooks the sequential and multi-party interactions between stakeholders and between their behaviors. More research is required to further explore hierarchical, heterogeneous, time-varying and role-dependent couplings and interactions between multi-parties, between their behavior sequences, and between customer preferences and decision-making expectations. Multi-party interaction processes and dynamics also involve other challenges to be modeled, e.g., the imbalance in action distribution which may follow a Beta rather than a normal process in some applications, and hierarchical dependencies from attribute values to objects (e.g., clients), and the heterogeneities between customers.
The recent advancements in RNNs with long short-term memory (LSTM) [51] and GRU has widely been applied to model the sequential decision processes [33]. They learn a representation for historical states and use the representation to inform decision-making to capture the dependencies between historical states and the current action. However, our method additionally captures the long-term interactions between actions and states and between actions and also incorporates the historical behaviors of clients into the current client states.
In addition, the recent work on sequential recommendation (such as next-item, next-basket and next-song [4,35,36] recommendation) and interactive recommendation [52] also involves contextual information. They typically apply neural networks and the attention mechanism [53] to model contextual information related to the current object. Such methods cannot make next-best action recommendation since they do not involve decision processes, the impact evaluation of next actions, or dynamic environments, etc.
Further, interactive personalized decision-making needs to dynamically evaluate and optimize the reward of each decision action and recommend the next-best action in relation to a customer's current states, future rewards to actions, the customer's future responses, and decision objectives. We model the effect of each action on each client by considering a client's current context, past long-term behaviors, and decision feedback (effectiveness) on past actions measured by domain-driven rewards. This creates a way to involve domain knowledge, historical experience, and client and action-specific circumstances into a real-life complex decisionmaking process and interaction learning.
Lastly, the pilot study on next-best actions for debt collection shows that modeling personalized, dynamic, sequential and interactive decision-making processes is often associated with diverse computational challenges. They include hierarchical imbalanced data distributions, multi-party interactions, and sequential, evolving, long-term and multi-sequence couplings and dependencies. Our neural interaction learning method paves a computational way to effectively and efficiently make personalized recommendations on next-best actions for a large number of clients in enterprise decision-making.

Conclusion
Multi-party interactions involve multiple coupled sequences, e.g. of each party's states, behaviors and contexts. Personalized decision-making needs to not only model these coupled sequences and the couplings both within and between these sequences but also the couplings between parties, e.g., between a decision-maker and its clients. The automated learning of next-best actions to be taken on each customer at each time is essential for personalized and automated decision-making in any applications involving customer services and communications. Learning personalized next-best actions has to further model the multi-party interactions for each customer and his decision-maker and learn heterogeneous dynamic multisequence couplings. These issues go beyond classic decision theories, Markovian decision process theories, and sequential modeling and recommendation. User modeling, sequential modeling, behavior informatics, recommender systems and personalized decision-making should be integrated to address the challenges and complexities in learning automated decision-making with personalized next-best action recommendation and in dynamic, interactive and evolving personalized decision-making processes.